Assignment1

1.1

There is no doubt that the second figure is easier to analysis than the first one. It shows that people can more easily distinguish four groups with hue-different colors than groups with the same hue but different brightnesses.

1.2

Evidently, the first plot is the easiest since for human the perception metrics is 3.1 bits for color hues. On contrary, the second plot is the hardest since metrics is 2.2 bits for sizes. Additionally, line length and orientation have 2.8 and 3 bits respectively for the third plot and it is hard to differentiate as well.

1.3

The problem is that the first figure divides the group numerically, which is wrong since the region cannot be for example 1.5 or 2.5. It can only be the integer 1, 2 or 3.

According to the new figure, we can identify decision boundaries very quickly. Since there are groups with three different colors, preattentive mechanism is possible.

1.4

It is quite hard to distinguish 27 different types. Combining metrics does not sum up the capacity. It indicates that the figure cannot display too much information (color, size and shape) together for human differentiation.

1.5

We can clearly see the decision boundary since that the feature ‘region’ is represented by 3 different color hues which can be preattentive. By Treisman’s theory, red, blue and green perform specific preattentive tasks in their individual-feature maps parallelly.

1.6

For human perception, it is difficult to differentiate angles. For instance, the parts of grey and yellow seem to be similar, but they have 1.2% difference actually. It is preferable to mark the percentages on every parts or use bar chart instead.

1.7

There is a huge gap between the bottle and middel parts of contour plot which is misleading. Reader might trust that some points locate in the gap which is wrong based on the reality.

Assignment 2

2.1 Load the data set

## # A tibble: 6 x 28
##   Team      League   Won  Lost Runs.per.game HR.per.game    AB  Runs  Hits
##   <chr>     <chr>  <dbl> <dbl>         <dbl>       <dbl> <dbl> <dbl> <dbl>
## 1 Aizona D~ NL        69    93          4.64       1.17   5665   752  1479
## 2 Atlanta ~ NL        68    93          4.03       0.758  5514   649  1404
## 3 Baltimor~ AL        89    73          4.59       1.56   5524   744  1413
## 4 Boston R~ AL        93    69          5.42       1.28   5670   878  1598
## 5 Chicago ~ NL       103    58          4.99       1.24   5503   808  1409
## 6 Chicago ~ AL        78    84          4.23       1.04   5550   686  1428
## # ... with 19 more variables: `2B` <dbl>, `3B` <dbl>, HR <dbl>, RBI <dbl>,
## #   StolenB <dbl>, CaughtS <dbl>, BB <dbl>, SO <dbl>, BAvg <dbl>,
## #   OBP <dbl>, SLG <dbl>, OPS <dbl>, TB <dbl>, GDP <dbl>, HBP <dbl>,
## #   SH <dbl>, SF <dbl>, IBB <dbl>, LOB <dbl>

We can see that the ranges of different variables have a big difference. For example, the HR.per.game is just more or less than 1. while the value of Runs is hundreds. So, it’s reasonable for performing a multidimensional scaling (MDS).

2.2 Non-metric MDS with Minkowski distance=2

Now, we use non-metric MDS with the Minkowsky distance=2 to present the data in two dimensions. It is also colored by leagues.

In our opinion, teams in the AL league is quite equally. In contrast, there is a difference between teams in NL league. The scatter plot created by isoMDS shows that the teams in AL league gather on the left-top concern compared with the teams in NL league. The MDS element V2 shows a better differentiation between leagues.By on the V2 component, the values of teams from AL league range from -1 to 4, while NL’s teams range from -4 to about 3. It is clear that Boston Red Sox is the outlier and far away from the other points.

2.3 Shepard plot for the MDS

Base on the Shepard plot, the MDS is quite successful and the correlation is 92.6%. The points (17,1) and (20,16) seem to be outliers and make MDS hard to map successfully.

2.4 Compare MDS column1 with other varibles

As we discussed in section 2.2, MDS V2 shows a better differentiate between the two leagues. According to the series of scatters that V2 against other numeric features, we find that HR and 3B show the strongest relationship. Here are these plots:

Here is the definition from Wikipedia: “home run (abbreviated HR) is scored when the ball is hit in such a way that the batter is able to circle the bases and reach home safely in one play without any errors being committed by the defensive team in the process” and 3B like a hit when the batter safely reached third base. We understand that these indexes represent some difficult technique in baseball, which can bring bonus points. As a result, these indexes make a key point to the result of each team.

Appendix

library(ggplot2)
library(gridExtra)
library(plotly)

#1.1
data <- read.table(file="olive.csv", sep = ",", header = TRUE)
P1.1.1 <- ggplot(data, aes(x=data$palmitic, y=data$oleic, color =data$linolenic)) +
  geom_point() +
  ggtitle("Dependence of Palmitic on Oleic colored by Linolenic") +
  scale_x_continuous(name ="Palmitic") +
  scale_y_continuous(name = "Oleic") +
  scale_colour_continuous(name = "Linolenic")
groups <- cut_interval(data$linolenic,4)
P1.1.2 <- ggplot(data, aes(x=data$palmitic, y=data$oleic, color=groups)) +
  geom_point() +
  ggtitle("Dependence of Palmitic on Oleic colored by Linolenic") +
  scale_x_continuous(name ="Palmitic") +
  scale_y_continuous(name = "Oleic") +
  scale_colour_discrete(name = "Linolenic")
P1.1.1
P1.1.2

##1.2
P1.1.2
x <- data$palmitic
y <- data$oleic
ggplot(data, aes(x=x, y=y,size =groups)) +
  geom_point() +
  ggtitle("Dependence of Palmitic on Oleic colored by Linolenic") +
  scale_x_continuous(name ="Palmitic") +
  scale_y_continuous(name = "Oleic") +
  scale_colour_continuous(name = "Linolenic")

ggplot(data, aes(x=x, y=y)) +
  geom_point() +
  geom_spoke(aes(angle=as.numeric(groups), radius = 50)) +
  ggtitle("Dependence of Palmitic on Oleic colored by Linolenic") +
  scale_x_continuous(name ="Palmitic") +
  scale_y_continuous(name = "Oleic") +
  scale_colour_continuous(name = "Linolenic")

##1.3
ggplot(data, aes(x=data$oleic, y=data$eicosenoic, colour =data$Region)) +
  geom_point() +
  ggtitle(" Oleic vs Eicosenoic colored by Linolenic") +
  scale_x_continuous(name ="Oleic") +
  scale_y_continuous(name = "Eicosenoic") +
  scale_colour_continuous(name = "Region")

color <- factor(data$Region,levels=c(1,2,3),labels=c("1","2","3"))
ggplot(data, aes(x=data$oleic, y=data$eicosenoic, color=color)) +
  geom_point() +
  ggtitle(" Oleic vs Eicosenoic ") +
  scale_x_continuous(name =" Oleic") +
  scale_y_continuous(name = "Eicosenoic ") +
  scale_colour_hue(name = "Region")

##1.4
x <- data$oleic
y <- data$eicosenoic
color <- cut_interval(data$linoleic,3)
shape <- cut_interval(data$palmitic ,3)
size <- cut_interval(data$palmitoleic,3)
ggplot(data, aes(x=x, y=y, color=color, size=size, shape=shape)) +
  geom_point()

##1.5
x <- data$oleic
y <- data$eicosenoic
color <- cut_interval(data$Region ,3)
shape <- cut_interval(data$palmitic ,3)
size <- cut_interval(data$palmitoleic ,3)
ggplot(data, aes(x=x, y=y, color=color, size=size, shape=shape)) +
  geom_point()

##1.6
nm <- as.character(unique(data$Area))
num <- rep(0, length(nm))
for(i in 1: length(nm)){
  idx <- data$Area==nm[i]
  num[i] <- sum(idx)
}
ncyl=num/sum(num)*100
df <- as.data.frame(cbind(nm,ncyl))
plot_ly(df, labels=~nm, values=ncyl) %>% add_pie() %>% layout(showlegend = FALSE)

##1.7
ggplot(data, aes(x= data$linoleic, y= data$eicosenoic))+
  stat_density_2d(aes(fill = ..level..), geom = "polygon", colour="white",alpha=0.4 )+
  geom_point()+
  scale_x_continuous(name ="Linoleic") +
  scale_y_continuous(name = "Eicosenoic")

#2.1
library(MASS)
library(plotly)
input_path <- "baseball-2016.xlsx"
ball <-readxl::read_excel(input_path)
list_range <- c()
head(ball)

for (i in 3:28) {
  range <- (max(ball[,i])- min(ball[,i]))
  list_range <- c(list_range, range)
}


#2.2
ball.numeric <- scale(ball[,3:28])
d <- dist(ball.numeric, method = "minkowski", p=2)
res <- isoMDS(d, k=2)
coords <- res$points
coordsMDS <- as.data.frame(coords)
coordsMDS$League <- ball$League
coordsMDS$name=ball$Team
plot_ly(coordsMDS, x=~V1, y=~V2, type="scatter", hovertext=~name, color= ~League, colors = c ("red", "black"))

#2.3
sh <- Shepard(d, coords)
delta <-as.numeric(d)
D<- as.numeric(dist(coords, method = "minkowski", p=2))

n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n)
index1=as.numeric(index[lower.tri(index)])

n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n, byrow = T)
index2=as.numeric(index[lower.tri(index)])

plot_ly()%>%
  add_markers(x=~delta, y=~D, hoverinfo = 'text',
              text = ~paste('Obj1: ', rownames(ball)[index1],
                            '<br> Obj 2: ', rownames(ball)[index2]))%>%
  add_lines(x=~sh$x, y=~sh$y)

cor(sh$x, sh$y)

#2.4
ball.numeric <- as.data.frame(ball.numeric)
all_column <- names(ball.numeric)
#See all plots
for (name in all_column) 
{
  plot(ball.numeric[,name], coordsMDS[,2])
}

#Draw two plot that have the most relevant
plot_ly( x=~ball.numeric[,10], y=~coordsMDS[,2], type = "scatter") %>%
  layout( xaxis = list(title = "HR"), yaxis=list(title= "MDS column2"))
plot_ly( x=~ball.numeric[,9], y=~coordsMDS[,2], type = "scatter") %>%
  layout( xaxis = list(title = "3B"), yaxis=list(title= "MDS column2"))